{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5分钟Python实现TXT和MySQL数据Join"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 一、实现步骤：\n",
    "1. 使用pandas的接口读取数据到dataframe\n",
    "2. 使用pandas的merge实现不同dataframe的join\n",
    "3. 将结果数据输出到csv/excel/mysql"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 语法1：Pandas读取MySQL数据表\n",
    "**read_sql(sql, con)**  \n",
    "Read SQL query or database table into a DataFrame."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 语法2：Pandas读取TXT数据\n",
    "**pd.read_csv(filepath_or_buffer, sep, header)**  \n",
    "Returns a DataFrame\n",
    "* filepath_or_buffer：指定文件路径\n",
    "* sep：字段分隔符\n",
    "* header：指定列名"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 语法3：Pandas的Merge函数实现Join\n",
    "pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,\n",
    "         left_index=False, right_index=False, sort=True,\n",
    "         suffixes=('_x', '_y'), copy=True, indicator=False,\n",
    "         validate=None)\n",
    "* left, right：DataFrame\n",
    "* how：join的方法分为'left', 'right', 'outer', 'inner'\n",
    "* left_on, right_on：Join的Key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 二、实战：实现csv日志文件和MySQL码表的JOIN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1、读取MySQL数据源到Dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pymysql\n",
    "conn = pymysql.connect(\n",
    "        host='127.0.0.1',\n",
    "        user='root',\n",
    "        password='12345678',\n",
    "        database='test',\n",
    "        charset='utf8'\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "mysql_page = pd.read_sql(\"select * from tb_page\", con=conn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1001</td>\n",
       "      <td>首页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>1002</td>\n",
       "      <td>列表页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>1003</td>\n",
       "      <td>详情页</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     id name\n",
       "0  1001   首页\n",
       "1  1002  列表页\n",
       "2  1003  详情页"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mysql_page"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2、读取TXT日志数据源到Dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt_log = pd.read_csv(\"./page_pvuv.txt\", sep=\"\\t\", header=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt_log.columns = [\"pdate\", \"page_id\", \"pv\", \"uv\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pdate</th>\n",
       "      <th>page_id</th>\n",
       "      <th>pv</th>\n",
       "      <th>uv</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>2019-09-01</td>\n",
       "      <td>1001</td>\n",
       "      <td>999</td>\n",
       "      <td>99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2019-09-01</td>\n",
       "      <td>1002</td>\n",
       "      <td>888</td>\n",
       "      <td>88</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>2019-09-01</td>\n",
       "      <td>1003</td>\n",
       "      <td>777</td>\n",
       "      <td>77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>2019-09-02</td>\n",
       "      <td>1001</td>\n",
       "      <td>666</td>\n",
       "      <td>66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>2019-09-02</td>\n",
       "      <td>1002</td>\n",
       "      <td>555</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>2019-09-02</td>\n",
       "      <td>1003</td>\n",
       "      <td>444</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>2019-09-03</td>\n",
       "      <td>1001</td>\n",
       "      <td>333</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>2019-09-03</td>\n",
       "      <td>1002</td>\n",
       "      <td>222</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>2019-09-03</td>\n",
       "      <td>1003</td>\n",
       "      <td>111</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        pdate  page_id   pv  uv\n",
       "0  2019-09-01     1001  999  99\n",
       "1  2019-09-01     1002  888  88\n",
       "2  2019-09-01     1003  777  77\n",
       "3  2019-09-02     1001  666  66\n",
       "4  2019-09-02     1002  555  55\n",
       "5  2019-09-02     1003  444  44\n",
       "6  2019-09-03     1001  333  33\n",
       "7  2019-09-03     1002  222  22\n",
       "8  2019-09-03     1003  111  11"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "txt_log"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3、实现两个dataframe的Join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "merge_data = pd.merge(\n",
    "    left=txt_log,\n",
    "    right=mysql_page,\n",
    "    how='inner',\n",
    "    left_on=\"page_id\",\n",
    "    right_on=\"id\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pdate</th>\n",
       "      <th>page_id</th>\n",
       "      <th>pv</th>\n",
       "      <th>uv</th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>2019-09-01</td>\n",
       "      <td>1001</td>\n",
       "      <td>999</td>\n",
       "      <td>99</td>\n",
       "      <td>1001</td>\n",
       "      <td>首页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2019-09-02</td>\n",
       "      <td>1001</td>\n",
       "      <td>666</td>\n",
       "      <td>66</td>\n",
       "      <td>1001</td>\n",
       "      <td>首页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>2019-09-03</td>\n",
       "      <td>1001</td>\n",
       "      <td>333</td>\n",
       "      <td>33</td>\n",
       "      <td>1001</td>\n",
       "      <td>首页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>2019-09-01</td>\n",
       "      <td>1002</td>\n",
       "      <td>888</td>\n",
       "      <td>88</td>\n",
       "      <td>1002</td>\n",
       "      <td>列表页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>2019-09-02</td>\n",
       "      <td>1002</td>\n",
       "      <td>555</td>\n",
       "      <td>55</td>\n",
       "      <td>1002</td>\n",
       "      <td>列表页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>2019-09-03</td>\n",
       "      <td>1002</td>\n",
       "      <td>222</td>\n",
       "      <td>22</td>\n",
       "      <td>1002</td>\n",
       "      <td>列表页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>2019-09-01</td>\n",
       "      <td>1003</td>\n",
       "      <td>777</td>\n",
       "      <td>77</td>\n",
       "      <td>1003</td>\n",
       "      <td>详情页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>2019-09-02</td>\n",
       "      <td>1003</td>\n",
       "      <td>444</td>\n",
       "      <td>44</td>\n",
       "      <td>1003</td>\n",
       "      <td>详情页</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>2019-09-03</td>\n",
       "      <td>1003</td>\n",
       "      <td>111</td>\n",
       "      <td>11</td>\n",
       "      <td>1003</td>\n",
       "      <td>详情页</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        pdate  page_id   pv  uv    id name\n",
       "0  2019-09-01     1001  999  99  1001   首页\n",
       "1  2019-09-02     1001  666  66  1001   首页\n",
       "2  2019-09-03     1001  333  33  1001   首页\n",
       "3  2019-09-01     1002  888  88  1002  列表页\n",
       "4  2019-09-02     1002  555  55  1002  列表页\n",
       "5  2019-09-03     1002  222  22  1002  列表页\n",
       "6  2019-09-01     1003  777  77  1003  详情页\n",
       "7  2019-09-02     1003  444  44  1003  详情页\n",
       "8  2019-09-03     1003  111  11  1003  详情页"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merge_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4、将结果数据存储到Excel文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "merge_data[['pdate','page_id','name','pv','uv']].to_excel(\"merge_pvuv_page.xlsx\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
